26 research outputs found
Assessing ASR Model Quality on Disordered Speech using BERTScore
Word Error Rate (WER) is the primary metric used to assess automatic speech
recognition (ASR) model quality. It has been shown that ASR models tend to have
much higher WER on speakers with speech impairments than typical English
speakers. It is hard to determine if models can be be useful at such high error
rates. This study investigates the use of BERTScore, an evaluation metric for
text generation, to provide a more informative measure of ASR model quality and
usefulness. Both BERTScore and WER were compared to prediction errors manually
annotated by Speech Language Pathologists for error type and assessment.
BERTScore was found to be more correlated with human assessment of error type
and assessment. BERTScore was specifically more robust to orthographic changes
(contraction and normalization errors) where meaning was preserved.
Furthermore, BERTScore was a better fit of error assessment than WER, as
measured using an ordinal logistic regression and the Akaike's Information
Criterion (AIC). Overall, our findings suggest that BERTScore can complement
WER when assessing ASR model performance from a practical perspective,
especially for accessibility applications where models are useful even at lower
accuracy than for typical speech.Comment: Accepted to Interspeech 2022 Workshop on Speech for Social Goo
Towards Agile Text Classifiers for Everyone
Text-based safety classifiers are widely used for content moderation and
increasingly to tune generative language model behavior - a topic of growing
concern for the safety of digital assistants and chatbots. However, different
policies require different classifiers, and safety policies themselves improve
from iteration and adaptation. This paper introduces and evaluates methods for
agile text classification, whereby classifiers are trained using small,
targeted datasets that can be quickly developed for a particular policy.
Experimenting with 7 datasets from three safety-related domains, comprising 15
annotation schemes, led to our key finding: prompt-tuning large language
models, like PaLM 62B, with a labeled dataset of as few as 80 examples can
achieve state-of-the-art performance. We argue that this enables a paradigm
shift for text classification, especially for models supporting safer online
discourse. Instead of collecting millions of examples to attempt to create
universal safety classifiers over months or years, classifiers could be tuned
using small datasets, created by individuals or small organizations, tailored
for specific use cases, and iterated on and adapted in the time-span of a day.Comment: Findings of EMNLP 202
The CALBC Silver Standard Corpus for Biomedical Named Entities - A Study in Harmonizing the Contributions from Four Independent Named Entity Taggers
The production of gold standard corpora is time-consuming and costly. We propose an alternative: the 'silver standard corpus' (SSC), a corpus that has been generated by the harmonisation of the annotations that have been delivered from a selection of annotation systems. The systems have to share the type system for the annotations and the harmonisation solution has use a suitable similarity measure for the pair-wise comparison of the annotations. The annotation systems have been evaluated against the harmonised set (630.324 sentences, 15, 956, 841 tokens). We can demonstrate that the annotation of proteins and genes shows higher diversity across all used annotation solutions leading to a lower agreement against the harmonised set in comparison to the annotations of diseases and species. An analysis of the most frequent annotations from all systems shows that a high agreement amongst systems leads to the selection of terms that are suitable to be kept in the harmonised set. This is the first large-scale approach to generate an annotated corpus from automated annotation systems. Further research is required to understand, how the annotations from different systems have to be combined to produce the best annotation result for a harmonised corpus
Reducing class imbalance during active learning for named entity annotation
In lots of natural language processing tasks, the classes to be dealt with often occur heavily imbalanced in the under-lying data set and classifiers trained on such skewed data tend to exhibit poor performance for low-frequency classes. We introduce and compare different approaches to reduce class imbalance by design within the context of active learn-ing (AL). Our goal is to compile more balanced data sets up front during annotation time when AL is used as a strategy to acquire training material. We situate our approach in the context of named entity recognition. Our experiments reveal that we can indeed reduce class imbalance and increase the performance of classifiers on minority classes while preserv-ing a good overall performance in terms of macro F-score
Classical Probabilistic Models and Conditional Random Fields
Klinger R, Tomanek K. Classical Probabilistic Models and Conditional Random Fields. Department of Computer Science, Dortmund University of Technology; 2007